ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4 #8949

liyafan82 · 2020-12-17T02:57:05Z

Support compressing/decompressing RecordBatch IPC buffers by LZ4.

github-actions · 2020-12-17T02:58:02Z

https://issues.apache.org/jira/browse/ARROW-10880

kiszk · 2020-12-20T06:53:58Z

java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java

+    // first 8 bytes reserved for uncompressed length, to be consistent with the
+    // C++ implementation.
+    ArrowBuf compressedBuffer = allocator.buffer(maxCompressedLength + SIZE_OF_MESSAGE_LENGTH);
+    compressedBuffer.setLong(0, unCompressedBuffer.writerIndex());


Do we need to ensure this in little-endian? (c.f. https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L385).

Revised accordingly. Thanks for your kind reminder.

kiszk · 2020-12-20T06:54:14Z

java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java

+      decompressor = factory.fastDecompressor();
+    }
+
+    long decompressedLength = compressedBuffer.getLong(0);


Revised. Thank you.

kiszk · 2021-01-04T12:39:10Z

Looks good to me

emkornfield · 2021-01-04T22:19:18Z

java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java

+import org.apache.arrow.memory.util.MemoryUtil;
+import org.apache.arrow.util.Preconditions;
+
+import net.jpountz.lz4.LZ4Compressor;


How was this library chosen? It looks like it might not have been released in a while?

My guess is that this import refers to this.

@kiszk You are right. I chose this library because our C++ implementation also depends on this repo (https://github.com/lz4/lz4).

emkornfield · 2021-01-04T22:19:51Z

java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java

+  }
+
+  @Override
+  public ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer) {


Suggested change

public ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer) {

public ArrowBuf compress(BufferAllocator allocator, ArrowBuf uncompressedBuffer) {

? or is this consistent with the existing API?

Nice catch. Thank you!

emkornfield · 2021-01-04T22:20:24Z

java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java

+    Preconditions.checkArgument(unCompressedBuffer.writerIndex() <= Integer.MAX_VALUE,
+        "The uncompressed buffer size exceeds the integer limit");
+
+    // create compressor lazily


For some scenarios (e.g. flight sender), we only need the compressor, while for others (e.g. flight receiver), we only need the decompressor. So there is no need to create both eagerly.

emkornfield · 2021-01-04T22:23:16Z

Is it possible to add a test to confirm that this can be read/written from the C++ implementation?

HedgehogCode · 2021-01-05T16:01:38Z

When I use the changes and try to compress and decompress an empty buffer (by using a variable sized vector with only missing values) I get a SIGSEGV (hs_err_pid10504.log):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f1209448fd0, pid=10504, tid=0x00007f1208bc7700
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6fdfd0]  jni_ThrowNew+0xc0
#
# Core dump written. Default location: /workspaces/arrow/java/vector/core or core.10504

This can be reproduced by adding the following test to TestCompressionCodec.java:

  @Test
  public void testEmptyBuffer() throws Exception {
    final int vecLength = 10;
    final VarBinaryVector origVec = new VarBinaryVector("vec", allocator);

    origVec.allocateNew(vecLength);

    // Do not set any values (all missing)
    origVec.setValueCount(vecLength);

    final List<ArrowBuf> origBuffers = origVec.getFieldBuffers();
    final List<ArrowBuf> compressedBuffers = compressBuffers(origBuffers);
    final List<ArrowBuf> decompressedBuffers = deCompressBuffers(compressedBuffers);

    // TODO assert that the decompressed buffers are correct
    AutoCloseables.close(decompressedBuffers);
  }

This looks like an error in the lz4-java library but I am not sure. I thought I should mention it here first.
(Note that I am using OpenJDK 8 and I haven't tried OpenJDK 11 yet)

liyafan82 · 2021-01-06T03:00:24Z

Is it possible to add a test to confirm that this can be read/written from the C++ implementation?

@emkornfield I think it is a good idea to provide e2e cross-language integration tests.
However, I am not sure if we are ready now.

In particular, we need to change the way buffers are released after compressing.
Previously, we release the buffers by directly closing the related vectors. This no longer works now, as vector's buffers are released by the codec, and the compressed buffers need to be released properly.

Solution to this problem may have impacts to other parts of the code base. So maybe we need another issue to discuss it (if we do not do it in this PR).

liyafan82 · 2021-01-06T03:01:45Z

When I use the changes and try to compress and decompress an empty buffer (by using a variable sized vector with only missing values) I get a SIGSEGV (hs_err_pid10504.log):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f1209448fd0, pid=10504, tid=0x00007f1208bc7700
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6fdfd0]  jni_ThrowNew+0xc0
#
# Core dump written. Default location: /workspaces/arrow/java/vector/core or core.10504

This can be reproduced by adding the following test to TestCompressionCodec.java:

  @Test
  public void testEmptyBuffer() throws Exception {
    final int vecLength = 10;
    final VarBinaryVector origVec = new VarBinaryVector("vec", allocator);

    origVec.allocateNew(vecLength);

    // Do not set any values (all missing)
    origVec.setValueCount(vecLength);

    final List<ArrowBuf> origBuffers = origVec.getFieldBuffers();
    final List<ArrowBuf> compressedBuffers = compressBuffers(origBuffers);
    final List<ArrowBuf> decompressedBuffers = deCompressBuffers(compressedBuffers);

    // TODO assert that the decompressed buffers are correct
    AutoCloseables.close(decompressedBuffers);
  }

This looks like an error in the lz4-java library but I am not sure. I thought I should mention it here first.
(Note that I am using OpenJDK 8 and I haven't tried OpenJDK 11 yet)

@HedgehogCode Thanks a lot for your effort and information. I will take a look at the problem.

liyafan82 · 2021-01-07T09:40:44Z

When I use the changes and try to compress and decompress an empty buffer (by using a variable sized vector with only missing values) I get a SIGSEGV (hs_err_pid10504.log):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f1209448fd0, pid=10504, tid=0x00007f1208bc7700
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6fdfd0]  jni_ThrowNew+0xc0
#
# Core dump written. Default location: /workspaces/arrow/java/vector/core or core.10504

This can be reproduced by adding the following test to TestCompressionCodec.java:

  @Test
  public void testEmptyBuffer() throws Exception {
    final int vecLength = 10;
    final VarBinaryVector origVec = new VarBinaryVector("vec", allocator);

    origVec.allocateNew(vecLength);

    // Do not set any values (all missing)
    origVec.setValueCount(vecLength);

    final List<ArrowBuf> origBuffers = origVec.getFieldBuffers();
    final List<ArrowBuf> compressedBuffers = compressBuffers(origBuffers);
    final List<ArrowBuf> decompressedBuffers = deCompressBuffers(compressedBuffers);

    // TODO assert that the decompressed buffers are correct
    AutoCloseables.close(decompressedBuffers);
  }

This looks like an error in the lz4-java library but I am not sure. I thought I should mention it here first.
(Note that I am using OpenJDK 8 and I haven't tried OpenJDK 11 yet)

@HedgehogCode The problem happend when lz4-java tried to decompress an empty buffer. I have fixed the problem by taking special case of empty buffers. Thanks again for your kind reminder.

jackylee-ch · 2021-01-12T12:49:33Z

java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java

+    }
+
+    ByteBuffer compressed = MemoryUtil.directBuffer(
+        compressedBuffer.memoryAddress() + SIZE_OF_MESSAGE_LENGTH, (int) compressedBuffer.writerIndex());


nit: the capacity may be (int) (compressedBuffer.writerIndex() - SIZE_OF_MESSAGE_LENGTH)?

Nice catch. Thank you @stczwd

HedgehogCode · 2021-01-12T17:10:39Z

The comment in the BodyCompression protobuf states:

Each constituent buffer is first compressed with the indicated compressor, and then written with the uncompressed length in the first 8 bytes as a 64-bit little-endian signed integer followed by the compressed buffer bytes (and then padding as required by the protocol). The uncompressed length may be set to -1 to indicate that the data that follows is not compressed, which can be useful for cases where compression does not yield appreciable savings.

Should the check for a length of -1 be made outside of CompressionCodec implementation? I think it would be useful to do it in the #decompress(BufferAllocator, ArrowBuf) method.

Should be pretty easy if I don't miss something:

    if (decompressedLength == -1L) {
      // handle uncompressed buffers
      return compressedBuffer.slice(SIZE_OF_MESSAGE_LENGTH,
          compressedBuffer.writerIndex() - SIZE_OF_MESSAGE_LENGTH);
    }

liyafan82 · 2021-01-15T09:05:44Z

The comment in the BodyCompression protobuf states:

Each constituent buffer is first compressed with the indicated compressor, and then written with the uncompressed length in the first 8 bytes as a 64-bit little-endian signed integer followed by the compressed buffer bytes (and then padding as required by the protocol). The uncompressed length may be set to -1 to indicate that the data that follows is not compressed, which can be useful for cases where compression does not yield appreciable savings.

Should the check for a length of -1 be made outside of CompressionCodec implementation? I think it would be useful to do it in the #decompress(BufferAllocator, ArrowBuf) method.

Should be pretty easy if I don't miss something:
    if (decompressedLength == -1L) {
      // handle uncompressed buffers
      return compressedBuffer.slice(SIZE_OF_MESSAGE_LENGTH,
          compressedBuffer.writerIndex() - SIZE_OF_MESSAGE_LENGTH);
    }

@HedgehogCode Thanks for your good suggestion. I have revised the code to implement the logic that when the compressed buffer is larger, we directly send the raw buffer with length -1. In addition, I have updated the test cast to make sure the code path of the logic is covered.

emkornfield · 2021-01-30T04:54:17Z

@liyafan82 per recent discussion on mailing list. I looked into it and the lz4 page mentioned https://commons.apache.org/proper/commons-compress/javadocs/api-release/org/apache/commons/compress/compressors/lz4/package-summary.html as a port, so that might offer better compatibiity as a library

liyafan82 · 2021-02-01T02:04:39Z

@liyafan82 per recent discussion on mailing list. I looked into it and the lz4 page mentioned https://commons.apache.org/proper/commons-compress/javadocs/api-release/org/apache/commons/compress/compressors/lz4/package-summary.html as a port, so that might offer better compatibiity as a library

@emkornfield Sounds reasonable. I will update the PR accordingly. Thanks for your good suggestion.

pitrou · 2021-02-03T18:21:07Z

See PR #9408 for integration tests.

liyafan82 · 2021-02-04T10:02:28Z

Switched to the commons-compress library, according to @emkornfield's suggestion.

emkornfield · 2021-02-11T04:50:57Z

@liyafan82 could you enable the java integration test to confirm that reading the files generated by C++ works before we merge (once we verify it is working I can take a final look)

emkornfield · 2021-02-11T04:52:22Z

java/vector/pom.xml

@@ -74,6 +74,11 @@
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
    </dependency>
+    <dependency>


I'm a little hesitant to take a direct dependency on any lz4 library. Is there away that this can be done optionally (similar to how the netty dependency for memory has been isolated?)

@emkornfield Sounds reasonable. I will try to revise the PR accordingly. Thanks for your good suggestion.

liyafan82 · 2021-02-12T09:32:35Z

@liyafan82 could you enable the java integration test to confirm that reading the files generated by C++ works before we merge (once we verify it is working I can take a final look)

Sure. I will do some tests for that.

emkornfield · 2021-02-13T04:48:05Z

efore we merge (once we verify it is working I can take a final look)

Sure. I will do some tests for that.

To run tests it should be sufficient to unskip the Java implementation in archery.

liyafan82 · 2021-02-24T07:15:32Z

To avoid the direct dependency on the lz4 library, I have extracted the concrete compression codec implementations to a separate module. Will continue to work on the integration tests.

emkornfield · 2021-03-10T03:38:14Z

@emkornfield Another (maybe ugly) solution that comes to my mind is to use different libraries for buffers with dependent and independent blocks?

No, I think we should figure out a way to have on implementation.

emkornfield · 2021-03-10T03:39:09Z

@liyafan82 let me know when you think this is ready for re-review. I think like I said I think getting a baseline working so we can do the follow-up work makes the most sense here.

liyafan82 · 2021-03-10T03:53:15Z

@liyafan82 let me know when you think this is ready for re-review. I think like I said I think getting a baseline working so we can do the follow-up work makes the most sense here.

@emkornfield Sorry for my delay. I am a little busy these days. I will try my best to make it ready in one or two days.

emkornfield · 2021-03-10T03:56:02Z

No, rush just wanted to make sure I knew when it was ready for another pass.

liyafan82 · 2021-03-11T11:39:00Z

@emkornfield I have replied to each of the previous comment. So maybe it is ready for a new round of review. Thanks.

emkornfield

Thank @liyafan82 a few more minor comments. I'd like to see this merged sooner rather then later so we can do the follow-up work. If you don't have bandwidth please let me know, and if it OK I can fixup my comments and push to this PR?

emkornfield · 2021-03-17T07:39:08Z

java/compression/pom.xml

+      <version>1.20</version>
+    </dependency>
+    <dependency>
+      <groupId>io.netty</groupId>


hmm, wonder why netty is required here though, I'll take a closer look.

emkornfield · 2021-03-17T07:44:14Z

java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionUtil.java

+  }
+
+  /**
+   * Process decompression by decompressing the buffer as is.


please update the docs to match, something like.

"Slice the buffer to contain the uncompressed bytes"

emkornfield · 2021-03-17T07:46:20Z

java/compression/src/main/java/org/apache/arrow/compression/Lz4CompressionCodec.java

+import org.apache.commons.compress.compressors.lz4.FramedLZ4CompressorOutputStream;
+import org.apache.commons.compress.utils.IOUtils;
+
+import io.netty.util.internal.PlatformDependent;


ahh this is where netty is used. we don't have an arrow wrapper for it?

I guess no for now. Maybe we can have one in the future, so we can remove the dependency on Netty (and other dependencies on Netty as well).

emkornfield · 2021-03-17T07:49:25Z

java/compression/src/main/java/org/apache/arrow/compression/Lz4CompressionCodec.java

+
+  @Override
+  public String getCodecName() {
+    return CompressionType.name(CompressionType.LZ4_FRAME);


With the new enum, maybe we can make this an accessor that returns and enum instead? and then the byte can be extracted from there where necesssary?

liyafan82 · 2021-03-17T08:06:42Z

Thank @liyafan82 a few more minor comments. I'd like to see this merged sooner rather then later so we can do the follow-up work. If you don't have bandwidth please let me know, and if it OK I can fixup my comments and push to this PR?

@emkornfield Thanks a lot for the further comments. I think I can fix them up today.

liyafan82 · 2021-03-17T08:32:52Z

With the new enum, maybe we can make this an accessor that returns and enum instead? and then the byte can be extracted from there where necesssary?

Sounds good. I have revised the code accordingly.

liyafan82 · 2021-03-17T08:33:29Z

please update the docs to match, something like.

"Slice the buffer to contain the uncompressed bytes"

Updated. Thank you.

emkornfield · 2021-03-17T18:57:45Z

+1 thank you. @liyafan82 did you have plans to work on the follow-up items or ZSTD? Otherwise I can take them up.

@HedgehogCode any thoughts on how to procede for LZ4? We can maybe discuss more on the performance JIRA?

liyafan82 · 2021-03-18T03:08:23Z

+1 thank you. @liyafan82 did you have plans to work on the follow-up items or ZSTD? Otherwise I can take them up.

@HedgehogCode any thoughts on how to procede for LZ4? We can maybe discuss more on the performance JIRA?

@emkornfield Thanks a lot for your effort.

I have started working on ARROW-11899 yesterday.
If you are interested in any of the items (including ARROW-11899), please feel free to assign them to yourself. I'd like to help with the review/discussions :-)

emkornfield · 2021-03-18T03:13:54Z

If you've already started ARROW-11899 then I'll let you finish it up, hopefully it isn't too much work. We are discussing on the ML the path forward for LZ4 in general, once that is cleared up we can figure out do the work including if @HedgehogCode is interesting in contributing.

liyafan82 · 2021-03-18T03:53:39Z

If you've already started ARROW-11899 then I'll let you finish it up, hopefully it isn't too much work. We are discussing on the ML the path forward for LZ4 in general, once that is cleared up we can figure out do the work including if @HedgehogCode is interesting in contributing.

Sounds good. Hopefully, I will prepare a PR in a few days.

pitrou · 2021-03-22T13:21:12Z

@liyafan82 @emkornfield Can you one of you update https://github.com/apache/arrow/blob/master/docs/source/status.rst#ipc-format once this is all finished?

liyafan82 · 2021-03-23T02:00:17Z

@liyafan82 @emkornfield Can you one of you update https://github.com/apache/arrow/blob/master/docs/source/status.rst#ipc-format once this is all finished?

@pitrou I will keep this in mind. Thanks for your kind reminder.

Support compressing/decompressing RecordBatch IPC buffers by LZ4. Closes apache#8949 from liyafan82/fly_1211_comp Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

github-actions bot added the Component: Java label Dec 17, 2020

kiszk reviewed Dec 20, 2020

View reviewed changes

emkornfield reviewed Jan 4, 2021

View reviewed changes

nevi-me mentioned this pull request Jan 8, 2021

ARROW-8676: [Rust] IPC RecordBatch body compression #9137

Closed

jackylee-ch reviewed Jan 12, 2021

View reviewed changes

emkornfield reviewed Feb 11, 2021

View reviewed changes

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

liyafan82 force-pushed the fly_1211_comp branch 3 times, most recently from d59982e to 5160d84 Compare February 24, 2021 07:11

liyafan82 force-pushed the fly_1211_comp branch 8 times, most recently from 8aab6b5 to b28986c Compare March 11, 2021 09:15

ARROW-10880: [Java] Resolve comments

04d60a7

emkornfield requested changes Mar 17, 2021

View reviewed changes

ARROW-10880: [Java] Resolve more comments

d4a6807

liyafan82 force-pushed the fly_1211_comp branch from b28986c to d4a6807 Compare March 17, 2021 08:29

emkornfield closed this in 7e711c9 Mar 17, 2021

asfimport mentioned this pull request Jan 26, 2022

[Java] Support compressing RecordBatch IPC buffers by LZ4 #26815

Closed

	public ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer) {
	public ArrowBuf compress(BufferAllocator allocator, ArrowBuf uncompressedBuffer) {

ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4 #8949

ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4 #8949

Conversation

liyafan82 commented Dec 17, 2020

github-actions bot commented Dec 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Jan 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented Jan 4, 2021

HedgehogCode commented Jan 5, 2021 • edited

liyafan82 commented Jan 6, 2021

liyafan82 commented Jan 6, 2021

liyafan82 commented Jan 7, 2021

jackylee-ch Jan 12, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HedgehogCode commented Jan 12, 2021

liyafan82 commented Jan 15, 2021

emkornfield commented Jan 30, 2021

liyafan82 commented Feb 1, 2021

pitrou commented Feb 3, 2021

liyafan82 commented Feb 4, 2021

emkornfield commented Feb 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyafan82 commented Feb 12, 2021

emkornfield commented Feb 13, 2021

liyafan82 commented Feb 24, 2021

emkornfield commented Mar 10, 2021

emkornfield commented Mar 10, 2021

liyafan82 commented Mar 10, 2021

emkornfield commented Mar 10, 2021

liyafan82 commented Mar 11, 2021

emkornfield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyafan82 commented Mar 17, 2021

liyafan82 commented Mar 17, 2021

liyafan82 commented Mar 17, 2021

emkornfield commented Mar 17, 2021

liyafan82 commented Mar 18, 2021

emkornfield commented Mar 18, 2021

liyafan82 commented Mar 18, 2021

pitrou commented Mar 22, 2021

liyafan82 commented Mar 23, 2021

HedgehogCode commented Jan 5, 2021 •

edited

jackylee-ch Jan 12, 2021 •

edited